8 (a) We placed the College.csv file in the Datasets directory. Let us access this file.


In [6]:
college = read.csv("Datasets/College.csv")
head(college) #Use fix(college) in R-Studio to display in internal editor


XPrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
Abilene Christian UniversityYes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15
Albertson College Yes 587 479 158 38 62 678 41 13500 3335 500 675 67 73 9.4 11 9727 55

I used the head() function to display only the first few tuples of the dataset. In 'R', we would use fix(college) in R-Studio to display in the internal editor.

NOTE: All columns may not be visible on print. So lets see the fields in the dataset (for reference).


In [9]:
names(college)


  1. 'X'
  2. 'Private'
  3. 'Apps'
  4. 'Accept'
  5. 'Enroll'
  6. 'Top10perc'
  7. 'Top25perc'
  8. 'F.Undergrad'
  9. 'P.Undergrad'
  10. 'Outstate'
  11. 'Room.Board'
  12. 'Books'
  13. 'Personal'
  14. 'PhD'
  15. 'Terminal'
  16. 'S.F.Ratio'
  17. 'perc.alumni'
  18. 'Expend'
  19. 'Grad.Rate'

8 (b) In the table, we do not want the college name to appear as a part of the data. However, this information may be useful later on. We can store them as row names. Let us now check the current row names of this table.


In [7]:
rownames(college)[1:10] #Display the first 10 row names


  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
  6. '6'
  7. '7'
  8. '8'
  9. '9'
  10. '10'

rownames() gives us the implicit row names of the table. Notice these numbers are not displayed in the table shown in 8(a). Let us change it to the first column enteries.


In [8]:
rownames(college) = college[ , 1]
head(college)


XPrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
Abilene Christian UniversityAbilene Christian UniversityYes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
Adelphi UniversityAdelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
Adrian CollegeAdrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
Agnes Scott CollegeAgnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
Alaska Pacific UniversityAlaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15
Albertson CollegeAlbertson College Yes 587 479 158 38 62 678 41 13500 3335 500 675 67 73 9.4 11 9727 55

Perfect! If the above table is viewed in RStudio, we would see the column name row.names above the bold college names. We now remove the column X from the table.


In [10]:
college = college[ , -1] #Exclude first column
head(college)


PrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
Abilene Christian UniversityYes 1660 1232 721 23 52 2885 537 74403300 450 2200 70 78 18.1 12 704160
Adelphi UniversityYes 2186 1924 512 16 29 2683 1227 122806450 750 1500 29 30 12.2 16 1052756
Adrian CollegeYes 1428 1097 336 22 50 1036 99 112503750 400 1165 53 66 12.9 30 873554
Agnes Scott CollegeYes 417 349 137 60 89 510 63 129605450 450 875 92 97 7.7 37 1901659
Alaska Pacific UniversityYes 193 146 55 16 44 249 869 75604120 800 1500 76 72 11.9 2 1092215
Albertson CollegeYes 587 479 158 38 62 678 41 135003335 500 675 67 73 9.4 11 972755

Note the first column in bold, although not explicitly mentioned, is called row.names. This is not a data column but rather the name that R is giving to each row.

8 (c) i. Let us now produce a numerical summary of the dataset.


In [11]:
summary(college)


 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate     
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00  

In the description above, it is easy to observe the qualitative(classification) and quantitative(regression) variables. The first column Private has only 2 values Yes or No and is hence categorical. Every other field is quantitative with a minimum value, maximum, mean, 1st Quartile, median and 3rd Quartile value.

8 (c) ii. We now display the relationship between the first 10 columns using a scatterplot matrix.


In [12]:
pairs(college[,1:10])


8 (c) iii. Let us produce side-by-side boxplots of Outstate versus Private.


In [13]:
plot(college$Private, college$Outstate, xlab="Public/Private Indicator", ylab="Out of State Tuition($)", main="Boxplot of Outstate Vs. Private")


8 (c) iv. Let us add a new categorical field called Elite which takes 2 values depending on Top10Perc:

  • Yes: when the proportion of students of a given college coming from the top 10% exceeds 50%.
  • No: when the proportion of students of a given college coming from the top 10% is less than 50%.

This is implemented by initializing this field for every college as No, then applying the condition.


In [14]:
Elite = rep("No", length(rownames(college))) #Initialize all entries of Elite to 'No'
Elite[college$Top10perc > 50] = "Yes" #If Top10Perc > 50, assign field as 'Yes'
Elite = as.factor(Elite)
college = data.frame(college, Elite)

In [15]:
head(college)


PrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.RateElite
Abilene Christian UniversityYes 1660 1232 721 23 52 2885 537 74403300 450 2200 70 78 18.1 12 704160 No
Adelphi UniversityYes 2186 1924 512 16 29 2683 1227 122806450 750 1500 29 30 12.2 16 1052756 No
Adrian CollegeYes 1428 1097 336 22 50 1036 99 112503750 400 1165 53 66 12.9 30 873554 No
Agnes Scott CollegeYes 417 349 137 60 89 510 63 129605450 450 875 92 97 7.7 37 1901659 Yes
Alaska Pacific UniversityYes 193 146 55 16 44 249 869 75604120 800 1500 76 72 11.9 2 1092215 No
Albertson CollegeYes 587 479 158 38 62 678 41 135003335 500 675 67 73 9.4 11 972755 No

NOTE 2: In case of multiple execution of the code above, we may end up with multiple instances of the field Elite. I once ended up with 4 before. They can be deleted using the command: college[,-(ncol(college)-4:ncol(college))]

Now that the Elite field has been appended as the last column, let us see how many such Elite universities exist.


In [16]:
summary(college$Elite)


No
699
Yes
78

So of our 777 colleges, 78 of them are Elite. The boxpolot below shows Out of State tution for elite and non-elite universities.


In [17]:
plot(college$Elite, college$Outstate, ylab="Out of State Tuition ($)", xlab="Is the University Elite?",main="Elite Vs. Outstate")


8 (c) v. We will create histograms of quantitative variables. Each histogram will be have 5, 10, 15 & 20 bins specified by the breaks argument. To plot a number of these histograms, we use par(). Let us start with Out of State Tuition.


In [18]:
par(mfrow=c(2,2))
for (numBins in 1:20){
    hist(college$Outstate, breaks=numBins, xlab="OutState", ylab="Freq", main=paste("Bins = ",numBins))
}


Interesting Observation: Even after iterating the number of break points from 1 to 20 for the field OutState, we observe the number of bins numBins between 7 and 13 to be the same. They are broken down into 10 bins, generating the same histogram. Similarly, the histograms generated for numBins from 14 to 20 are also the same. Each generating the same 20 bin histogram. Also, the number of breaks from 3 to 6 generate a 5 bin histogram.

Let us check the same for another quantitative field: Book Cost.


In [19]:
par(mfrow=c(2,2))
for (numBins in 1:20){
    hist(college$Books, breaks=numBins, xlab="Book Cost", ylab="Freq", main=paste("Bins = ",numBins))
}


We observe similar histograms when the number of bins was set from:

  • 4 throguh 8
  • 9 through 16
  • 17 through 20

Let us plot the graphs in different ways, varying the par arguments.


In [22]:
par(mfrow=c(5,2))
for (numBins in 1:20){
    hist(college$Books, breaks=numBins, xlab="Book Cost", ylab="Freq", main=paste("Bins = ",numBins))
}


Observation: It seems to be plotting 5 rows and 2 columns of histograms in the same space as it was plotting 2 rows and 2 columns of histograms before.

8 (c) vi. SEE OBSERVATIONS AND NOTES IN THE ANSWERS ABOVE

Additionally, From the scatter plot matrix in 8 (b) ii., we observe nearly linear relationships between different variables like F. Undergrad Vs Enroll and Top10Prec Vs Top25Prec. While analyzing our data, we should only retain one covariate (feature) among others which are highly correlated.